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INTRODUCTION 

The noiseless data-compression algorithms introduced by Lempel and 
Ziv( 6 ' 7 ) parse an input data string into successive substrings each 
consisting of two parts: The citation, which is the longest prefix 
that has appeared earlier in the input, and the innovation, which is 
the symbol immediately following the citation. In "extremal" 
versions of the LZ algorithm the citation may have begun anywhere in 
the input; in "incremental" versions it must have begun at a previous 
parse position. Originally the citation and the innovation were 
encoded, either individually or jointly, into an output word to be 
transmitted or stored. Subsequently, it was been speculated by 
several authors ( 2 ' * ' ^ ^ that the cost of this encoding may be 
excessively high because the innovation contributes roughly lg(A) 
bits, where A is the size of the input alphabet, regardless of the 
compressibility of the source. To remedy this excess, they suggested 
storing the parsed substring as usual, but encoding for output only 
the citation, leaving the innovation to be encoded as the first symbol 
of the next substring. Being thus included in the next substring, the 
innovation can participate in whatever compression that substring 
enjoys. We call this strategy deferred innovation. It is exemplified 
in the algorithm described by Welch and implemented in the C 
program compress that has widely displaced adaptive Huffman coding 
(compact) as a UNIX system utility. 

While compress achieves respectable compression ratios on highly 
compressible data (say two-to-one or better), it performs poorly, 
compared to theory and to other versions of LZ compression, on 
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relatively incompressible data. In the extreme of total 
incompressibility, such as uniform i.i.d. or well encrypted data, 
compress frequently expands the input by about 45% when the output 
word size is 12 bits and by about 90% when the output word size is 16, 
to mention two common options. 2 

These figures stand in contrast to LZ realizations without deferred 
innovation, where random data are expanded by about 5% for output 
words of 12 or more bits. The purpose of this paper is to explain the 
excessive expansion, and implicitly to warn against the use of 
deferred-innovation compressors on nearly incompressible data. 

Suppose a deferred-innovation LZ algorithm operates on a string of b- 
bit input characters producing B-bit output words, 3 and assume that as 
in most implementations the dictionary of citations is initialized 
with all the individual symbols of the input alphabet. For an input 
string "x y x z . . . " such an algorithm will output B bits for the 
first "x" , and store "xy" ? then output B bits for "y", and store 
"yx"; then output B bits for "x" , and store "xz" ; and so on. In 
general, B bits will be output for every position that initiates a 
novel pair, that is, a pair not seen earlier. What we shall show in 
the next part of the paper is that if the input length is much less 
than the square of the alphabet size, the "typical” string of length N 
has almost N novel pairs, and therefore the output length must be 
almost NB, and the compression ratio almost B/b. Now, when the input 
is a string over the alphabet of 256 bytes, the input length would 
have to be comparable to 2 ^ to avoid this condition; otherwise the 
compression ratio will likely be close to 12/8 = 1.5 or 16/8 = 2.0 for 
common choices of B. This is just the behavior mentioned above for 
the program compress. The "typical" string is generated by a uniform 

2 In default mode, compress refuses to recode a file doomed to 
expansion. 

3 While ideally this output word would be variable in length, it 
is easy to show that not much is gained by the complication, so we 
shall conform to practice and make the wordlength fixed. 
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independent source over the alphabet, or selected uniformly from 
among all the possible strings of length N. 

We say an ordered pair of consecutive input symbols is novel if and 
only if the identical ordered pair has not appeared earlier in the 
input string. Let S^(A,N) be the number of strings of length N over 
an alphabet of size A, with a novel pair beginning at position i. The 
pair beginning in position i can be repeating, like "xx" , or 
nonrepeating, like "xy" . Thus Si(A,N) consists of two parts, 
according to whether the novel pair at position i is repeating or not; 
see Figure 1. 

xx 


- i - 2 - 
xy 


- i - 1 - 


Clearly S!(A,N) = A N , and 

Si(A,N) = (A - l)A N “ 1-1 D(A,i - 2) + (A - lJA^^A,! - 1), 

where D(A,i-2) counts strings of length i-2 containing no "xx", and 
E(A,i-l) counts strings of length i-1 containing no "xy", while x and 
y range over the alphabet. (We assume the indices are nonnegative and 
take D(A, 0) = E(A,0) =1 by convention). 

Figure 2 shows, for the respective processes (or languages) that 
contain no repeating pair "xx" or no nonrepeating pair "xy", the 
state diagrams, adjacency matrices, characteristic equations, and the 
latters' roots. 


xxx 

t «- n - i - 1 - 

position i 
i 

xy 

- n - i - 1 - 

FIGURE 1 
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FIGURE 2 

From linear theory, 

D(A,k) = d + 8_j5 + d_Si£ and E ( A, k) = e+e_£ + e_eJ^, 

where d + ,d_ , e + ,and e_ are constants determined by the initial 

conditions: 


D 0 (A) = E 0 (A) = 1, D!(A) = E x (A) = A. 


In particular 


e+ = A - A -1 - 0(A -3 ) , e_ = A -1 + 0(A -3 ) , 
e + = 1 - 0(A -2 ) , and e_ = 0(A -2 ) . 

We can now estimate S(A,N), the total number of novel pairs among all 
strings of length N, and A -N S(A,N), the average number of novel pairs 
per string. We underestimate S(A,N) by ignoring D (A, N) , which is 
positive. Likewise, since e_ and t- are positive, we underestimate by 
ignoring them. This leaves the approximation 
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N 

S (A, N) > A N + (A -1)A N_1 2 A~ 1+1 E._ 

i-2 

N-l . w • , 

= A N + (A-l) 2 e+e^A N 11 1 

j=l 

> A n = (A - l)e+e+ [A N_2 + A N-3 e +„+ e N_2 ] 

Multiplying and dividing by N-l, and using the fact that the 

arithmetic mean dominates the geometric mean, we have 

S (A, N) > A N + (A - 1) e+e-f (N -1) (Ae + ) (N ~ 2 J /2 

=(1 - 0(A -1 )) [A N + (N -1)A N (1 -1/A 2 ) (N “ 2)/2 ] 

In the limit of large N this last expression is well approximated by 

A n + (N -1)A n exp ( -N + 2/2A 2 ) . 

Division by A N confirms the claim that the average number of novel 
pairs per string of length N remains about N until the string length 
exceeds the square of the alphabet size. 

A simpler but similar calculation can be used to estimate the expected 
number of novel singletons (symbols) in a string of length N. As 

before, let S^(A,N) be the number of sequences in which the i th 
symbol is novel. Let that symbol be "x" ; then the previous i-1 

symbols may be anything but x, and the succeeding N-i symbols may be 
anything at all. Thus 

N 

Si(A,N) = (A - 1) 1-1 A N “ 1+1 and S(A,N) = 2 Si(A,N). 

i=l 
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As before, this can be underestimated by the geometric mean to give: 


S (A, N) > AN 


( A< N - 1)/2 (A-1) (N - 1)/2 ) 


which is asymptotically 


NA n exp -(N -1/2A) . 

Thus the average number of novel singletons (symbols) in a sequence of 
length N remains about N until N exceeds the alphabet size. Similar 
arguments may be used as well to show that almost all k— tuples will be 
novel until the sequence length exceeds the k*-* 1 power of the alphabet 
size . 

DISTRIBUTION OF MEMORY CONTENTS 

We next consider the distribution of pairs, triples, and higher-order 
tuples in the L/Z compressor memory during three regimes: While the 

memory is filling but not yet full; when it has just filled; and when 
it is full and in equilibrium. Our assumption is still that input 
symbols are selected uniformly and independently over some finite 
alphabet. Another assumption must be made, regarding possible 
deletions from the memory once it has filled. In practice a variety 
of deletion strategies have been used, notably l.r.u., whereby the 
least-recently used entry is deleted to make room for the newest 
insertion. In this paper we will usually make the simpler assumption 
that the entry to be deleted is chosen randomly from among the non- 
singletons. In other words, deletion is random except that the 
alphabetic symbols are immune. 

Memory Filling 

Initially the compressor memory (or dictionary) contains a singleton 
entries, namely the symbols themselves. Each time a match to a 


\ 
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singleton is found a pair is inserted; should a pair be matched, its 
extension to a triple is inserted, and so on. Given the uniform, 
independent input assumption, it is clear that the likelihood of 
matching a given pair is only 1 /a times the likelihood of matching a 
singleton. Since we are interested mainly in large values of a like 
32, 64, 128, 256, we will ignore the possibility of creating 
quadruples or higher-order tuples, and lump them with the triples. 
Thus the memory at any time contains a singles, p pairs, and 7 others. 
Let \ be the total number of memory locations, and let \x = X - alpha 
be the number of locations available for pairs and higher-order 
tuples. Then at the time of the t ^* 1 insertion we have 

P + 7 = t for t < M, P + 7 = |A for t 

The distribution of p (hence 7 ) at time t during filling is given by 
the formula 


Pr (Pit } = a -t S(t,P)a. 


(P) 


where S(t,P) is a Stirling number of the second kind, and a^ is a 
falling factorial of a 3 . The reason is that there are S^. ^ ways of 
choosing a sequence of t symbols which includes exactly P distinct 
symbols, and that the identities of those p distinct symbols can be 
chosen in exactly a^ ways. This count is then divided by a^, the 
total number of ways of choosing a sequence of t symbols from the 
alphabet. Since jx is a large number for any reasonable compressor, we 
really need the asymptotic distribution in order to analyze the 
possibly transient behavior when the memory has just filled, but we 
don't know it at this time. 


Transient Period Under L.R.U. Deletion 


When the memory has just filled with pairs and higher-order tuples we 
speculate that there might be interaction between insertion and 
deletion by the l.r.u. rule that could cause temporary instability. 
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In particular, if the memory size is close to then most of the 
earliest arrivals will have been pairs, and many of the recent 
arrivals will be triples, as a result of pairs having been matched and 
extended. This suggests a "gradient" from most recent to least recent 
shading from triples to pairs. In such a case, under 1 . r.u. -deletion, 
disproportionately more pairs will be deleted, abnormally increasing 
the proportion of non-pairs until the inability to match triples 
causes pairs to be recreated and reinserted. The alternation in 
proportions of pairs versus higher-order tuples would likely damp 
out. This transient behavior has not been confirmed, but is a topic 
of ongoing research. The distinction between this hypothesis and the 
equilibrium analyses below stems from the l.r.u. deletion policy, 
which makes critical not just the distribution of pairs and non-pairs, 
but their arrangement in memory as well. 

Equilibrium State and Distribution 

Finally we consider the distribution of memory contents and the 
compression ratio at equilibrium. We once again invoke the 
assumption of random deletion (contrary to the l.r.u. rule used in the 
previous, speculative section) . First we solve for an equilibrium 
state, that is, a ratio of pairs to non-pairs that is stable, and then 
we generalize to an equilibrium distribution of probabilities of 
ratios . 

Equilibrium State 

Suppose that the memory is full, that it contains p pairs (and thus 
(x - p non-pairs) and that a randomly chosen input pair is read. The 
probability that the input matches some pair in the memory is p/a^. 
The probability that some pair (rather than a triple) is chosen for 
deletion is p/|x. Since these are independent events, the four joint 
probabilities for the change in p are: 
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(+ 1 ) • 


ct z - p p, - p 


a* |x 


gain a pair, lose a triple 


A(5 = 


ct z - P p 


Ot^ |X 


gain a pair, lose a pair 


P \x - P 


p, 


gain a triple, lose a triple 


P P 

(-D • -r * - 
a z p, 


gain a triple, lose a pair 


At equilibrium the first and last probabilities must be equal, and we 
can solve for p: (Recall that we are ignoring the creation of 

quadruples or high-orders) . 

a 2 p m 2 

(3 ; y = n - a 

a 2 + n a 2 + ii 

Using these values we can estimate the compression ratio achieved in 
this equilibrium state by compress, which will output B bits (12 by 
default) for each 2b bits in a pair can be matched, and will output B 
bits for only b bits in when a pair cannot be matched. The ratio at 
equilibrium is thus 

(a 2 - J3) B + /3B a 2 B (a 2 + n)lqn 

Peq = + 0(l/o:) . 

(a 2 - ( 3 ) b + 2/?B (a 2 + /?)b (a 2 + 2/i)lga 

From this expression we would expect compress in default mode (with 
b=8, B=12) to have p e q = 1.42, which is quite close to experience. 
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Equilibrium Distribution 


We consider next the equilibrium distribution governing the number of 
pairs present in the compressor memory. Again we assume that the 
probabilities of creating quadruples, quintuples, and so on are 
negligible, so that they can be lumped together with the triples. As 
usual, the singletons are permanent memory residents. 

With memory size jx consider the random variable (3 describing the 
number of pairs present. fi ranges from 0, when the memory has no 
pairs, to min (jx,a^) when either the memory is full of pairs, or all 
pairs are present. As each parse of the uniform, independent input is 
made, (5 may increase by 1 or decrease by 1 (except at the extremes) or 
stay the same, with the respective probabilities given in Section Cl. 
above 4 . This gives us a Markov process with transition matrix 


a 2 /u 


j = i - 1, 


T. . = 


i 2 i i 2 

— + - + -2 j = i, 

a 2 n a 2 [i 


l l 

(1 - --) (1 ) j = i + 1, 


a‘ 


M 


otherwise. 


Because this is a connected Markov process, it has an equilibrium 
distribution p 1 which satisfies pT = p, or (pT)i = pj^. We show in the 
Appendix that 


4 This distribution was erroneously described in the Snowbird 
talk as an Ehrenfest model 1 . As we shall see, it is rather like a 
componentwise product of two Ehrenfest processes. 
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a 2 jj, 

Pi = 

i i 


min(a 2 , ju) 
/ 2 
0 


o 

a * 

i i 


For a very simple example, let a 2 


4 , jx = 5 . Then 


T = 


0 20 0 0 
1 8 12 0 
0 4 10 6 

0 0 9 9 

0 0 0 16 


0 

0 

0 

2 

4 


11 1 

P (i) = — (1,4, 6, 4,1) X — (1,5,10,10,5) = (1,20,60,40,5). 

16 31 126 


This means that asymptotically the distribution is the product of two 
Gaussian distributions, with relatively displaced means unless = jjl . 
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APPENDIX 


It suffices to deal with q = p 2 p(i) an< i to show that 

a 2 M 

(qT)i = q i - 1 T i - 1 ' i + qi T i,i + Wi+l,i = i i 


i - 1 i - 1 


(qT) i = 


1 - 


i - 1 i - 1 


a' 


1 - 


M 


+ 


a 2 m 


l l 


l l 

a 2 m 


2i‘ 


a 2 /x 


a' 


M ( 


i + 1 i + 1 


1 r- a 2 /x 

a 2 /i ' — i - 1 i - 1 


(a 2 - i + 1) (M " i + 1) 


+ 


a 2 M 


l l 


(a 2 i + /ii - 2i 2 ) + 


a" 


i + 1 i + 1 


(i + ^ 


a 2 M 


a' 

i 




[i 2 + a 2 i + Mi " 2i 2 + (a 2 - i) (M ■ 


a' 

i 


M 

i . 


c 


b 


Qi • 


i + l) 2 


2 ] 

i) ] 


Q. E. D. 
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